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Abstract 

Some statistical models are specified via a data generating process for 
which the likelihood function cannot be computed in closed form. Standard 
likelihood-based inference is then not feasible but the model parameters can 
be inferred by finding the values which yield simulated data that resemble 
the observed data. This approach faces at least two major difficulties: The 
first difficulty is the choice of the discrepancy measure which is used to 
judge whether the simulated data resemble the observed data. The second 
difficulty is the computationally efficient identification of regions in the pa¬ 
rameter space where the discrepancy is low. We give here an introduction 
to our recent work where we tackle the two difficulties through classification 
and Bayesian optimization. 


1 Introduction 

The likelihood function plays a central role in statistics and machine learning. It is the joint 
probability of the observed data X seen as a function of the model parameters of interest 
6. We may assume that the data X are a realization of a two stage sampling process, 

Z~Pi(Z|0°), X~P2(X|Z,0°), (1) 

where Z are unobserved variables and 9° are some fixed but unknown values of the model 
parameters. The likelihood function L{6) is implicitly defined via an integral, 

L{e) = p(x|e) = J P2(x|z, 0)pi(z|0)dz. (2) 

For many realistic data generating processes, the integral cannot be computed analytically 
in closed form, and numerical approximation is computationally too costly as well. Stan¬ 
dard likelihood-based inference is then not feasible. But inference can be performed by 
using the possibility to simulate data from the model. Such simulation-based likelihood- 
free inference methods have emerged in multiple disciplines: “Indirect inference” originated 
in economics (Gourieroux et ah, 1993), “approximate Bayesian computation” (ABC) in ge¬ 
netics (Beaumont et ah, 2002; Marjoram et ah, 2003; Sisson et ah, 2007), or the “synthetic 
likelihood” approach in ecology (Wood, 2010). The different methods share the basic idea 
to identify the model parameters by finding values which yield simulated data that resemble 
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Result: N samples 0i... 6^ 
1 for i = 1 to N do 
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repeat 

Propose parameter 9 

Generate pseudo observed data Yg 

until X « Yg 


6 set 0i = 0 

7 end 


p{Yg\9) 


affects the speed of the inferenee 
affects the quality of the inference 


Algorithm 1: Basic ABC algorithm. 


the observed data. The inference process is shown in a schematic way in Algorithm 1 in the 
framework of ABC. 

In Algorithm 1, two fundamental difficulties of the aforementioned inference methods are 
highlighted. One difficulty is the measurement of similarity, or discrepancy, between the 
observed data X and the simulated data Ye (line 5). The choice of discrepancy measure 
affects the statistical quality of the inference process. The second difficulty is of computa¬ 
tional nature. Since simulating data Ye can be computationally very costly, one would like 
to identify the region in the parameter space where the simulated data resemble the ob¬ 
served data as quickly as possible, without proposing parameters 9 which have a negligible 
chance to be accepted (line 3). 

We have been working on both problems, the choice of the discrepancy measure 
and the fast identification of the parameter regions of interest (Gutmann et ah, 2014; 
Gutmann and Gorander, 2015). The following two sections are a brief introduction to the 
two papers. 

2 Discriminability as discrepancy measure 

We transformed the original problem of measuring the discrepancy between Yg and X into 
a problem of classifying the data into simulated versus observed (Gutmann et ah, 2014). 
Intuitively, it is easier to discriminate between two data sets which are very different than 
between data which are similar, and when the two data sets are generated with the same 
parameter values, the classification task cannot be solved significantly above chance-level. 
This motivated us to use the discriminability (classihability) as discrepancy measure, and 
to perform likelihood-free inference by identifying the parameter values which yield chance- 
level discriminability only (Gutmann et ah, 2014). 

We next illustrate this approach using a toy example. The data X = (xi,..., Xn) are 
assumed to be sampled from a standard normal distribution (black curve in Figure 1(a)), and 
the parameter of interest 9 is the mean. For data simulated with mean 9 = 6 (green curve), 
the two densities barely overlap so that classihcation is easy. In fact, linear discriminant 
analysis (LDA) yields a discriminability of almost 100% (Figure 1(b), green dashed curve). 
If the data are simulated with a mean closer to zero, for example with 0 = 1/2 (red curve), 
the simulated data Yg become more similar to X and the classihcation accuracy drops to 
around 60% (red dashed curve). For 0 = 0, where the simulated and observed data are 
generated with the same values of 0, only chance-level discriminability of 50% is obtained. 
This illustrates how discriminability can be used as a discrepancy measure. 

We analyzed the validity of this approach theoretically and demonstrated it on more chal¬ 
lenging synthetic data as well as real data with an individual-based epidemic model for 
bacterial infections in day care centers (Gutmann et ah, 2014). The hnding that classihca¬ 
tion can be used to measure the discrepancy has both practical and theoretical value: The 
main practical value is that the rather difficult problem of choosing a discrepancy measure 
is reduced to a more standard problem where we can leverage on effective existing solutions. 
The theoretical value lies in the establishment of a tight connection between likelihood-free 
inference and classihcation - two helds of research which appear rather different at hrst 
glance. 
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(a) Gaussian densities (b) Classification performance of LDA 

Figure 1: Discriminability as discrepancy measure, illustration on toy data (n = 10,000). 


3 Bayesian optimization to identify parameter regions of interest 

In the following, we denote a certain discrepancy measure by Ag. A small value of Ag is 
assumed to imply that Yg are judged to be similar to X. The difficulty in finding parameter 
regions where Ag is small is at least twofold: First, the mapping from 6 to Ag can generally 
not be expressed in closed form and derivatives are not available either. Second, Ag is 
actually a stochastic process due to the use of simulations to obtain Yg. We illustrate 
this in Figure 2 for our Gaussian toy example where is the discriminability between X 
and Yg (for further examples, see Gutmann and Corander, 2015). The figure visualizes the 
distribution of for n = 50. The fact that A^ is a random process was suppressed in 
Figure 1 by working with a large sample size. 

We used Bayesian optimization, a combination of nonlinear (Gaussian process) regression 
and optimization (see, for example, Brochu et ah, 2010), to quickly identify regions where 
Ag is likely to be small (Gutmann and Corander, 2015). In Bayesian optimization, the 
available information {( 6 ^^\ k = 1,... ,K} about the relation between 6 and Ag is 

used to build a statistical model of Ag , and new data are actively acquired in regions where 
the minimum of Ag is potentially located. After acquisition of the new data, e.g. a tuple 
(^(if+i)^ A^^+^^), the model is updated using Bayes’ theorem. 

For our simple toy example, the region around zero was identified as the region of interest 
within ten acquisitions (Figure 3(a-e)). While the location of the minimum is approximately 
correct, the posterior mean approximates the (empirical) mean of A^ in Figure 2 only 
roughly. As more evidence about the behavior of A^ in the region of interest is acquired, 
the fit improves (Figure 3(f)). 

In the full paper (Gutmann and Corander, 2015), we show that Bayesian optimization not 
only allows to quickly identify the regions of interest but also to perform approximate 
posterior inference. Our findings are supported by theory, and applications to real data 
analysis with intractable models. In our applications, the inference was accelerated through 
a reduction in the number of required simulations by several orders of magnitude. 


4 Conclusions 

Two major difficulties in likelihood-free inference are the choice of the discrepancy mea¬ 
sure between simulated and observed data, and the identification of regions in the pa¬ 
rameter space where the discrepancy is likely to be small. The former difficulty is more 
of statistical, the latter more of computational nature. We gave a brief introduction to 
our recent work on the two issues: We used classification to measure the discrepancy 
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Figure 2: Distribution of Aq for the Gaussian toy example using n = 50. 


(Gutmann et ah, 2014), and Bayesian optimization to quickly identify regions of low dis¬ 
crepancy (Gutmann and Gorander, 2015). 
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(e) Model after ten acquisitions (f) Model after twenty acquisitions 

Figure 3: Bayesian optimization to quickly identify the parameter regions of interest. 
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